T-SNE visualisation of word embeddings generated using 19th century literature

t-distributed stochastic neighbor embedding (t-SNE) is a

statistical Statistics (from German: ''Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industria ...

method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally developed by Sam Roweis and

Geoffrey Hinton Geoffrey Everest Hinton One or more of the preceding sentences incorporates text from the royalsociety.org website where: (born 6 December 1947) is a British-Canadian cognitive psychologist and computer scientist, most noted for his work on ar ...

, where Laurens van der Maaten proposed the ''t''-distributed variant. It is a

nonlinear dimensionality reduction Nonlinear dimensionality reduction, also known as manifold learning, refers to various related techniques that aim to project high-dimensional data onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low-d ...

technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability. The t-SNE algorithm comprises two main stages. First, t-SNE constructs a

probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...

over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a lower probability. Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the

Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...

(KL divergence) between the two distributions with respect to the locations of the points in the map. While the original algorithm uses the

Euclidean distance In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefor ...

between objects as the base of its similarity metric, this can be changed as appropriate. t-SNE has been used for visualization in a wide range of applications, including

genomics Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dim ...

computer security Computer security, cybersecurity (cyber security), or information technology security (IT security) is the protection of computer systems and networks from attack by malicious actors that may result in unauthorized information disclosure, the ...

research,

natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...

music analysis Musical analysis is the study of musical structure in either compositions or performances. According to music theorist Ian Bent, music analysis "is the means of answering directly the question 'How does it work?'". The method employed to answe ...

cancer research Cancer research is research into cancer to identify causes and develop strategies for prevention, diagnosis, treatment, and cure. Cancer research ranges from epidemiology, molecular bioscience to the performance of clinical trials to evaluate and ...

bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...

, geological domain interpretation, and biomedical signal processing. While t-SNE plots often seem to display clusters, the visual clusters can be influenced strongly by the chosen parameterization and therefore a good understanding of the parameters for t-SNE is necessary. Such "clusters" can be shown to even appear in non-clustered data, and thus may be false findings. Interactive exploration may thus be necessary to choose parameters and validate results. It has been demonstrated that t-SNE is often able to recover well-separated clusters, and with special parameter choices, approximates a simple form of

spectral clustering In multivariate statistics, spectral clustering techniques make use of the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided as ...

Details

Given a set of

N

high-dimensional objects

\mathbf_1, \dots, \mathbf_N

, t-SNE first computes probabilities

p_

that are proportional to the similarity of objects

\mathbf_i

and

\mathbf_j

, as follows. For

i \neq j

, define :

p_ = \frac

and set

p_ = 0

. Note that

\sum_j p_ = 1

for all

i

. As Van der Maaten and Hinton explained: "The similarity of datapoint

x_j

to datapoint

x_i

is the conditional probability,

p_

, that

x_i

would pick

x_j

as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at

x_i

." Now define :

p_ = \frac

This is motivated because

p_

and

p_

from the N samples are estimated as 1/N, so conditional probability can be written as

p_ = Np_

and

p_ = Np_

. Since

p_ =  p_

, you can obtain previous formula. Also note that

p_ = 0

and

\sum_ p_ = 1

. The bandwidth of the

Gaussian kernel In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the base form f(x) = \exp (-x^2) and with parametric extension f(x) = a \exp\left( -\frac \right) for arbitrary real constants , and non-zero . It is ...

\sigma_i

is set in such a way that the

entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...

of the conditional distribution equals a predefined entropy using the

bisection method In mathematics, the bisection method is a root-finding method that applies to any continuous function for which one knows two values with opposite signs. The method consists of repeatedly bisecting the interval defined by these values and the ...

. As a result, the bandwidth is adapted to the

density Density (volumetric mass density or specific mass) is the substance's mass per unit of volume. The symbol most often used for density is ''ρ'' (the lower case Greek letter rho), although the Latin letter ''D'' can also be used. Mathematical ...

of the data: smaller values of

\sigma_i

are used in denser parts of the data space. Since the Gaussian kernel uses the Euclidean distance

\lVert x_i-x_j \rVert

, it is affected by the

curse of dimensionality The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. The ...

, and in high dimensional data when distances lose the ability to discriminate, the

p_

become too similar (asymptotically, they would converge to a constant). It has been proposed to adjust the distances with a power transform, based on the

intrinsic dimension The intrinsic dimension for a data set can be thought of as the number of variables needed in a minimal representation of the data. Similarly, in signal processing of multidimensional signals, the intrinsic dimension of the signal describes how many ...

of each point, to alleviate this. t-SNE aims to learn a

d

-dimensional map

\mathbf_1, \dots, \mathbf_N

(with

\mathbf_i \in \mathbb^d

and

d

typically chosen as 2 or 3) that reflects the similarities

p_

as well as possible. To this end, it measures similarities

q_

between two points in the map

\mathbf_i

and

\mathbf_j

, using a very similar approach. Specifically, for

i \neq j

, define

q_

as :

q_ = \frac

and set

q_ = 0

. Herein a heavy-tailed

Student t-distribution In probability and statistics, Student's ''t''-distribution (or simply the ''t''-distribution) is any member of a family of continuous probability distributions that arise when estimating the Expected value, mean of a Normal distribution, norm ...

(with one-degree of freedom, which is the same as a

Cauchy distribution The Cauchy distribution, named after Augustin Cauchy, is a continuous probability distribution. It is also known, especially among physicists, as the Lorentz distribution (after Hendrik Lorentz), Cauchy–Lorentz distribution, Lorentz(ian) fun ...

) is used to measure similarities between low-dimensional points in order to allow dissimilar objects to be modeled far apart in the map. The locations of the points

\mathbf_i

in the map are determined by minimizing the (non-symmetric)

of the distribution

P

from the distribution

Q

, that is: :

\mathrm\left(P \parallel Q\right) = \sum_ p_ \log \frac

The minimization of the Kullback–Leibler divergence with respect to the points

\mathbf_i

is performed using

gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...

. The result of this optimization is a map that reflects the similarities between the high-dimensional inputs.

Software

* The R packag
Rtsne
implements t-SNE in R. *

ELKI ELKI (for ''Environment for DeveLoping KDD-Applications Supported by Index-Structures'') is a data mining (KDD, knowledge discovery in databases) software framework developed for use in research and teaching. It was originally at the database s ...

contains tSNE, also with Barnes-Hut approximation *

scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector ...

, a popular machine learning libary in Python implements t-SNE with both exact solutions and the Barnes-Hut approximation. * Tensorboard, the visualization kit associated with

TensorFlow TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks. "It is machine learnin ...

, also implements t-SNE
online version

References

{{reflist

External links

Visualizing Data Using t-SNE
Google Tech Talk about t-SNE
Implementations of t-SNE in various languages
A link collection maintained by Laurens van der Maaten Machine learning algorithms Dimension reduction